
possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune
some higher-level portion of the network. This is motivated by the observation that the earlier
features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors)
that should be useful to many tasks, but later layers of the ConvNet becomes progressively more
specific to the details of the classes contained in the original dataset. In case of ImageNet for
example, which contains many dog breeds, a significant portion of the representational power of the
ConvNet may be devoted to features that are specific to differentiating between dog breeds.
Pretrained models. Since modern ConvNets take 2-3 weeks to train across multiple GPUs on
ImageNet, it is common to see people release their final ConvNet checkpoints for the benefit of
others who can use the networks for fine-tuning. For example, the Caffe library has a Model
Zoo where people share their network weights.
When and how to fine-tune? How do you decide what type of transfer learning you should
perform on a new dataset? This is a function of several factors, but the two most important ones
are the size of the new dataset (small or big), and its similarity to the original dataset (e.g.
ImageNet-like in terms of the content of images and the classes, or very different, such as
microscope images). Keeping in mind that ConvNet features are more generic in early layers
and more original-dataset-specific in later layers, here are some common rules of thumb for
navigating the 4 major scenarios:
1. New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to
fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we
expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best
idea might be to train a linear classifier on the CNN codes.
2. New dataset is large and similar to the original dataset. Since we have more data, we can have more
confidence that we won’t overfit if we were to try to fine-tune through the full network.
3. New dataset is small but very different from the original dataset. Since the data is small, it is likely
best to only train a linear classifier. Since the dataset is very different, it might not be best to train
the classifier form the top of the network, which contains more dataset-specific features. Instead, it
might work better to train the SVM classifier from activations somewhere earlier in the network.
4. New dataset is large and very different from the original dataset. Since the dataset is very large, we
may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often
still beneficial to initialize with weights from a pretrained model. In this case, we would have enough
data and confidence to fine-tune through the entire network.
Practical advice. There are a few additional things to keep in mind when performing Transfer
Learning:
Constraints from pretrained models. Note that if you wish to use a pretrained network, you may be
slightly constrained in terms of the architecture you can use for your new dataset. For example, you
can’t arbitrarily take out Conv layers from the pretrained network. However, some changes are
straight-forward: Due to parameter sharing, you can easily run a pretrained network on images of
different spatial size. This is clearly evident in the case of Conv/Pool layers because their forward
function is independent of the input volume spatial size (as long as the strides “fit”). In case of FC
layers, this still holds true because FC layers can be converted to a Convolutional Layer: For example,
in an AlexNet, the final pooling volume before the first FC layer is of size [6x6x512]. Therefore, the FC
layer looking at this volume is equivalent to having a Convolutional Layer that has receptive field size
6x6, and is applied with padding of 0.
Learning rates. It’s common to use a smaller learning rate for ConvNet weights that are being fine-
tuned, in comparison to the (randomly-initialized) weights for the new linear classifier that computes
the class scores of your new dataset. This is because we expect that the ConvNet weights are